412 research outputs found

    Making Digital Artifacts on the Web Verifiable and Reliable

    Get PDF
    The current Web has no general mechanisms to make digital artifacts --- such as datasets, code, texts, and images --- verifiable and permanent. For digital artifacts that are supposed to be immutable, there is moreover no commonly accepted method to enforce this immutability. These shortcomings have a serious negative impact on the ability to reproduce the results of processes that rely on Web resources, which in turn heavily impacts areas such as science where reproducibility is important. To solve this problem, we propose trusty URIs containing cryptographic hash values. We show how trusty URIs can be used for the verification of digital artifacts, in a manner that is independent of the serialization format in the case of structured data files such as nanopublications. We demonstrate how the contents of these files become immutable, including dependencies to external digital artifacts and thereby extending the range of verifiability to the entire reference tree. Our approach sticks to the core principles of the Web, namely openness and decentralized architecture, and is fully compatible with existing standards and protocols. Evaluation of our reference implementations shows that these design goals are indeed accomplished by our approach, and that it remains practical even for very large files.Comment: Extended version of conference paper: arXiv:1401.577

    Performance of the Charniak-Lease parser on biological text using different training corpora

    Get PDF
    POS tagging is used as the first step in many NLP workflows, although the accuracy of tag assignment frequently goes unchecked. We hypothesize that changing the training corpora for a parser will affect its POS tagging of a target corpus. To this end we train the Charniak-Lease parser on the WSJ corpus and two biomedical corpora and evaluate its output to MedPost, a POS tagger with a reported 97% accuracy on biomedical text. Our findings indicate that using biomedical training corpora significantly improves performance, but that minor differences in the biomedical training corpora have a significant effect on the correctness of POS tagging. Specifically, the tagging of hyphenated words and verbs was affected. This work suggests that the choice of training corpora is crucial to domain targeted NLP analysis

    A web API ecosystem through feature based reuse

    Get PDF
    The fast-growing web API landscape brings clients more options than ever before-in theory. In practice, they cannot easily switch between different providers offering similar functionality. We discuss a vision for developing web APIs based on reuse of interface parts called features. Through the introduction of five design principles, we investigate the impact of feature-based reuse on web APIs. Applying these principles enables a granular reuse of client and server code, documentation, and tools. Together, they can foster a measurable ecosystem with cross-API compatibility, opening the door to a more flexible generation of web clients

    A web API ecosystem through feature based reuse

    Get PDF
    The fast-growing web API landscape brings clients more options than ever before-in theory. In practice, they cannot easily switch between different providers offering similar functionality. We discuss a vision for developing web APIs based on reuse of interface parts called features. Through the introduction of five design principles, we investigate the impact of feature-based reuse on web APIs. Applying these principles enables a granular reuse of client and server code, documentation, and tools. Together, they can foster a measurable ecosystem with cross-API compatibility, opening the door to a more flexible generation of web clients

    Advancing discovery science with fair data stewardship:Findable, accessible, interoperable, reusable

    Get PDF
    This report summarizes a presentation by Dr. Michel Dumontier. It reviews innovative scientific research methods created by data science, and the need to develop infrastructure, methodologies, and user communities to advance data science. Stakeholders have proposed a set of principles to make digital resources findable, accessible, interoperable, and reusable—FAIR. FAIR principles provide guidelines, do not require specific technologies, and allow communities of stakeholders to define specific FAIR standards and develop metrics to quantify them. Libraries can be part of the new data ecosystemby providing education, data stewardship, and infrastructure

    A Web API ecosystem through feature-based reuse

    Get PDF
    The current Web API landscape does not scale well: every API requires its own hardcoded clients in an unusually short-lived, tightly coupled relationship of highly subjective quality. This directly leads to inflated development costs, and prevents the design of a more intelligent generation of clients that provide cross-API compatibility. We introduce 5 principles to establish an ecosystem in which Web APIs consist of modular interface features with shared semantics, whose implementations can be reused by clients and servers across domains and over time. Web APIs and their features should be measured for effectiveness in a task-driven way. This enables an objective and quantifiable discourse on the appropriateness of a certain interface design for certain scenarios, and shifts the focus from creating interfaces for the short term to empowering clients in the long term

    Putting FAIR Evidence into Practice

    Get PDF

    NBLAST: a cluster variant of BLAST for NxN comparisons

    Get PDF
    BACKGROUND: The BLAST algorithm compares biological sequences to one another in order to determine shared motifs and common ancestry. However, the comparison of all non-redundant (NR) sequences against all other NR sequences is a computationally intensive task. We developed NBLAST as a cluster computer implementation of the BLAST family of sequence comparison programs for the purpose of generating pre-computed BLAST alignments and neighbour lists of NR sequences. RESULTS: NBLAST performs the heuristic BLAST algorithm and generates an exhaustive database of alignments, but it only computes [Image: see text] alignments (i.e. the upper triangle) of a possible N(2) alignments, where N is the set of all sequences to be compared. A task-partitioning algorithm allows for cluster computing across all cluster nodes and the NBLAST master process produces a BLAST sequence alignment database and a list of sequence neighbours for each sequence record. The resulting sequence alignment and neighbour databases are used to serve the SeqHound query system through a C/C++ and PERL Application Programming Interface (API). CONCLUSIONS: NBLAST offers a local alternative to the NCBI's remote Entrez system for pre-computed BLAST alignments and neighbour queries. On our 216-processor 450 MHz PIII cluster, NBLAST requires ~24 hrs to compute neighbours for 850000 proteins currently in the non-redundant protein database
    • …
    corecore